Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up configuration resolution #2905

Merged
merged 10 commits into from
May 10, 2024
Merged

Conversation

s-ludwig
Copy link
Member

@s-ludwig s-ludwig commented May 3, 2024

This speeds up the getPackageConfigs call issued by getDefaultConfiguration from 14.8 ms to 3.6 ms for the particular project that I'm benchmarking. There is still a lot more to gain, but this is a good start.

s-ludwig added 4 commits May 3, 2024 12:10
…ageConfigs.

Package.name and Package.getAllDependencies both need to reallocate their return value for every call, which results in heavy GC pressure during the configuration resolution algorithm.
Avoids reallocating the edges array every time the configuration resolution eliminates a configuration.
@s-ludwig s-ludwig requested a review from atilaneves May 3, 2024 12:42
Copy link

github-actions bot commented May 3, 2024

✅ PR OK, no changes in deprecations or warnings

Total deprecations: 0

Total warnings: 0

Build statistics:

 statistics (-before, +after)
-executable size=5364248 bin/dub
-rough build time=63s
+executable size=5376664 bin/dub
+rough build time=64s
Full build output
DUB version 1.36.0, built on Mar  3 2024
LDC - the LLVM D compiler (1.37.0):
  based on DMD v2.107.1 and LLVM 17.0.6
  built with LDC - the LLVM D compiler (1.37.0)
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3
  http://dlang.org - http://wiki.dlang.org/LDC


  Registered Targets:
    aarch64     - AArch64 (little endian)
    aarch64_32  - AArch64 (little endian ILP32)
    aarch64_be  - AArch64 (big endian)
    amdgcn      - AMD GCN GPUs
    arm         - ARM
    arm64       - ARM64 (little endian)
    arm64_32    - ARM64 (little endian ILP32)
    armeb       - ARM (big endian)
    avr         - Atmel AVR Microcontroller
    bpf         - BPF (host endian)
    bpfeb       - BPF (big endian)
    bpfel       - BPF (little endian)
    hexagon     - Hexagon
    lanai       - Lanai
    loongarch32 - 32-bit LoongArch
    loongarch64 - 64-bit LoongArch
    mips        - MIPS (32-bit big endian)
    mips64      - MIPS (64-bit big endian)
    mips64el    - MIPS (64-bit little endian)
    mipsel      - MIPS (32-bit little endian)
    msp430      - MSP430 [experimental]
    nvptx       - NVIDIA PTX 32-bit
    nvptx64     - NVIDIA PTX 64-bit
    ppc32       - PowerPC 32
    ppc32le     - PowerPC 32 LE
    ppc64       - PowerPC 64
    ppc64le     - PowerPC 64 LE
    r600        - AMD GPUs HD2XXX-HD6XXX
    riscv32     - 32-bit RISC-V
    riscv64     - 64-bit RISC-V
    sparc       - Sparc
    sparcel     - Sparc LE
    sparcv9     - Sparc V9
    spirv32     - SPIR-V 32-bit
    spirv64     - SPIR-V 64-bit
    systemz     - SystemZ
    thumb       - Thumb
    thumbeb     - Thumb (big endian)
    ve          - VE
    wasm32      - WebAssembly 32-bit
    wasm64      - WebAssembly 64-bit
    x86         - 32-bit X86: Pentium-Pro and above
    x86-64      - 64-bit X86: EM64T and AMD64
    xcore       - XCore
   Upgrading project in /home/runner/work/dub/dub/
    Starting Performing "release" build using /opt/hostedtoolcache/dc/ldc2-1.37.0/x64/ldc2-1.37.0-linux-x86_64/bin/ldc2 for x86_64.
    Building dub 1.37.0-rc.1+commit.41.gbfdf300a: building configuration [application]
     Linking dub
STAT:statistics (-before, +after)
STAT:executable size=5376664 bin/dub
STAT:rough build time=64s

@s-ludwig s-ludwig force-pushed the speed_up_config_resolution branch from d0bc073 to c40ca1e Compare May 3, 2024 12:46
… packages.

Avoids costly AA lookups and string comparisons.
@s-ludwig s-ludwig force-pushed the speed_up_config_resolution branch from c40ca1e to c620806 Compare May 3, 2024 13:06
@s-ludwig s-ludwig force-pushed the speed_up_config_resolution branch from c620806 to 449d412 Compare May 3, 2024 13:22
s-ludwig added 2 commits May 3, 2024 15:35
Uses a set to determine redundant edge entries instead of iterating through the list.
@s-ludwig
Copy link
Member Author

s-ludwig commented May 3, 2024

Added two more optimizations to get down to 3.1 ms.

@s-ludwig
Copy link
Member Author

s-ludwig commented May 3, 2024

Now down to 2.6 ms.

}).array;
// eliminate all edges that connect to config 'i'
size_t eti = 0;
foreach (ei, e; edges) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this for loop is small enough to justify variable names such as eti, ei, and e.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop contains 8 lines of code. What's your personal threshold?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop contains 8 lines of code. What's your personal threshold?

For one letter variables? 1 line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, a handful of lines is okay, as long variable names are not only making the code easier to understand, but also harder to read - although non-standard variables like the ones here are more critical than the standard ones, if used correctly (i, j, k , x, y, z). However, the parameter i to removeConfig actually violates my own style guide and should be a full word instead, which would free up the i for the foreach loop.

But IMO, what would really be the right way to go here would be to generalize the array filter algorithm, so that this becomes edges = edges.filterInPlace!((e) => ...) (analogous to std.algorithm.mutation.remove). Unfortunately, Phobos doesn't have that AFAIK and I didn't want to mix non-local refactoring (or any substantial refactoring for that matter) into the same PR that does optimization.

source/dub/project.d Outdated Show resolved Hide resolved
source/dub/project.d Outdated Show resolved Hide resolved
source/dub/project.d Show resolved Hide resolved
@atilaneves
Copy link
Contributor

I have a concern about code complexity. I found the original function hard to understand; it took me a couple of hours of summarising it in English before I understood what it did well enough to be able to optimise it. This PR makes even more complicated.

Having said that, I tried this PR on the pathological case that motivated my own PR and the runtime for getPackageConfigs went from ~290ms to ~90ms. The "problem" is that it's still death by a thousand cuts. Packages with fewer dependencies don't take that long to begin with so our total runtime went from 15s to 12s.

@s-ludwig
Copy link
Member Author

s-ludwig commented May 7, 2024

I have a concern about code complexity. I found the original function hard to understand; it took me a couple of hours of summarising it in English before I understood what it did well enough to be able to optimise it. This PR makes even more complicated.

I agree with that. Part of it is based on the unfortunate design choice of making Package.recipe and Package.rawRecipe mutable references, which is something that would be nice to remedy at some point - and then ultimately reducing the complexity here again. Other parts could be improved by introducing new container types (Set!T, ListSet!T), but in the end this is an algorithmic complexity issue and not a code style issue. Also, the code can be refactored for more clarity for sure, but that is a different topic, as that would target the high-level structure (which is not changed by this PR).

Having said that, I tried this PR on the pathological case that motivated my own PR and the runtime for getPackageConfigs went from ~290ms to ~90ms. The "problem" is that it's still death by a thousand cuts. Packages with fewer dependencies don't take that long to begin with so our total runtime went from 15s to 12s.

Without being able to reproduce your pathological case or more precise information, I cannot help you there. You identified this particular piece of code to be responsible for a considerable slowdown, so I went head and optimized it. But apart from that, a 20% total improvement is not what I would associate with "death by a thousand cuts".

@atilaneves
Copy link
Contributor

You identified this particular piece of code to be responsible for a considerable slowdown, so I went head and optimized it

Yes, and thanks!

a 20% total improvement is not what I would associate with "death by a thousand cuts".

Normally, no, but the majority of the runtime is still in getPackageConfigs. It gets called twice to make things worse, once to add the test runner configuration and another in the generator when getting the build information (so with different arguments). Our build has 80 dub packages, which means death by 160 cuts. If each "cut" is ~50ms on average it means an added runtime of 8s.

@s-ludwig
Copy link
Member Author

Normally, no, but the majority of the runtime is still in getPackageConfigs. It gets called twice to make things worse, once to add the test runner configuration and another in the generator when getting the build information (so with different arguments). Our build has 80 dub packages, which means death by 160 cuts. If each "cut" is ~50ms on average it means an added runtime of 8s.

But no matter how often getPackageConfigs gets called, a 2x improvement stays a 2x improvement. The death by a thousand cuts would IMO apply more to a situation where a thousand different places each contribute a thousandth of the runtime.

Just to quickly get the numbers straight here:

  • The overall number for you improved by roughly 25% - assuming that getPackageConfigs takes up 50% of the total time after the optimization, that would mean a speedup for getPackageConfigs of only 1.5x
  • Assuming the 160 calls per total runtime is true, that means with the optimized version a single call takes about 38 ms on average (15x more than for my benchmark project)
  • Your pathological test case improved by roughly 3.2x
  • In my test case, I'm getting an improvement of around 5.7x
  • Applying the 3.2x factor above to estimate the overall runtime taken by getPackageConfigs would result in just 29% of the total runtime of 15 s (11% afterwards for the 12 s)

What seems odd here is that, speaking of the pathological test case, you both seem to have a larger payload (assuming the single-thread performance of the machine is not much lower) and a considerably worse speedup at the same time. I would usually expect the opposite to be true and it would be interesting to find out why that happens.

On the other hand, if I should bet, I would say that the reality is that the majority of the runtime is actually spent somewhere else. Taking the estimation from the last bullet point, this would mean that optimizing getPackageConfigs alone will never result in less than maybe 11 s. Did you measure absolute time, or is this a statistical measurement from a sampling profiler (which can be heavily skewed)?

@atilaneves atilaneves merged commit a87d3cc into master May 10, 2024
31 checks passed
@ibuclaw ibuclaw deleted the speed_up_config_resolution branch May 10, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants